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Abstract 

This paper considers multiple regression procedures for analyzing the relationship between a response 
variable and a vector of d covariates in a nonparametric setting where both tuning parameters and the 
number of covariates need to be selected. We introduce an approach which handles the dilemma that 
with high dimensional data the sparsity of data in regions of the sample space makes estimation of 
nonparametric curves and surfaces virtually impossible. This is accomplished by abandoning the goal 
of trying to estimate true underlying curves and instead estimating measures of dependence that can 
determine important relationships between variables. These dependence measures are based on local 
parametric fits on subsets of the covariate space that vary in both dimension and size within each 
dimension. The subset which maximizes a signal to noise ratio is chosen, where the signal is a local 
estimate of a dependence parameter which depends on the subset dimension and size, and the noise is an 
estimate of the standard error (SE) of the estimated signal. This approach of choosing the window size to 
maximize a signal to noise ratio lifts the curse of dimensionality because for regions with sparsity of data 
the SE is very large. It corresponds to asymptotically maximizing the probability of correctly finding 
non-spurious relationships between covariates and a response or, more precisely, maximizing asymptotic 
power among a class of asymptotic level a t-tests indexed by subsets of the covariate space. Subsets that 
achieve this goal are called features. We investigate the properties of specific procedures based on the 
preceding ideas using asymptotic theory and Monte Carlo simulations and find that within a selected 
dimension, the volume of the optimally selected subset does not tend to zero as n — > oo unless the volume 
of the subset of the covariate space where the response depends on the covariate vector tends to zero. 
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1 Introduction 



In this paper, we analyze the relationship between a response variable Y and a vector of 
covariates X = (Xi,X 2 , . . . ,Xd) T by using "signal to noise" descriptive measures of depen- 
dencies between Y and X. This approach addresses a dilemma of nonparametric statistical 
analysis (the curse of dimensionality), that with high dimensional data, sparsity of data in 
large regions of the sample space makes estimation of nonparametric curves and surfaces 
virtually impossible. On the other hand, if the goal is to find regions of strong dependence 
between variables, parametric methods may provide evidence for such relationships. This 
paper develops hybrid methods between parametric and nonparametric procedures that are 
designed to lift the curse of dimensionality by uniting the goal of analyzing dependencies 
between Y and the X's with the goal of finding a good model. 

These methods are based on measures of dependencies between variables rather than 
on the estimation of some "true" curve or surface. Such measures can, for instance, be 
based on the currently available procedures that are defined in terms of tuning parameters. 
However, instead of asking what value of the tuning parameter will bring us closest to the 
"truth" , we ask what value of the tuning parameter will give us the best chance of finding 
a relationship between the variables, if it exists. For instance, consider the case where the 
tuning parameter is the window size h = . . . , hd) T of local regions in IR d within which we 
do local parametric fits of Y to X. Our approach consists of finding the "best" window size 
h by maximizing a local signal to noise ratio where the signal is an estimate of a measure of 
dependence and the noise is the estimated standard error (SE) of that estimate. By dividing 
by the noise we lift the curse of dimensionality because the noise will be very large if the 
tuning parameter is such that the local region in the sample space is nearly empty. 

Here we focus on the problem of exploring the relationship between the response Y and 
a covariate of interest X\ while controlling for covariates X 2 ,X 3 , . . . , Applied research 
abounds with situations where one is interested in the relationship between two variables 
while controlling for many other factors. The usual approach to dealing with the curse 
of dimensionality is to assume either a simple parametric relationship (usually linear) or a 
nonparametric but additive relationship between the covariates and response. Our approach 
allows for nonparametric modeling of the interaction effects and hence can find more subtle 



2 



patterns in the data. 

Our analysis methods begin with a reduction step, a procedure to analyze the multivariate 
data in a family of subsets of the covariate space, subsets that vary in both dimension and size 
within each dimension. This step is motivated by the question "Over which subsets of the 
covariates do these data provide evidence of a significant relationship between the response 
and the covariate of interest?" Once these subsets (called features) are extracted from the 
data, the analysis can go in different directions depending on the objective. Inspection of 
the features is itself a useful exploratory data analysis tool. Scale space analysis is possible 
by varying the minimum feature size. 

The idea of u sing the signal t o noise to choo s e a good p r ocedu re is motivated by the result 



( jPitmanl (119481 ) , iPitmanl (jl979l ) , ISerflingJ (jl980l ) , iLehmannl (jl999l ) ) that the asymptotic power 
of asymptotically normal test statistics T n for testing Hq: 9 = 9q versus H\ \ 9 > 9$ when 9 
is contiguous to 9 can be compared by considering their efficacies, defined as 



or the pre-limit version 



EFF (T n 



EFF e (T n ) 



=00 



See 



SD flo (T n ) ' 

Eg(T n ) — Eg (T n 

SD e (T n ) 



(2) 



Doksum and Schaferl (120061 ). who considered the single covariate case and selected subsets 



using estimates of the nonparametric efficacy 

Ep(T n ) — E Ho (T n 



EFF(T n 



SD(T n 



(3) 



where H is a specified independence property, P is the probability distribution of (X, Y), 
and SD(T n ) is either SDp(T n ) or SD# (T n ). Let SE(T n ) denote a consistent estimator for 
SD(T n ) in the sense that SE(T n )/SD(T n ) —> 1. We refer to estimates 

T n — E Ho (T n ) 



EFF 



SE(T n 



(4) 



of EFF(T n ) as a signal to noise ratio or t-statistic. If T n — ► T(P) for some functional 
T(P) = 9, then EFF resembles the familiar Wald test statistic 



ir 



On — dp 

SE(9 n ) ' 



(5) 



By Slutsky's Theorem, EFF — ► N{0, 1) in general, as is the case of tw for parametric 
models. We could refer to "EFF — — > N(0, 1)" as the Wald-Slutsky phenomenon. In some 
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frame works, if we consider (EFF) 2 , this is related to the Wilks phenomenon (IFan. Zhang, and Zhang 
( 120011 ) ) . By not squaring EFF we have a much simpler bias problem that makes it possible 
to address new optimality questions. 

Most work in the nonparametric testing literature starts with a class of tests depending 
on a bandwidth h (or other tuning parameter) which tends to zero and then finds the fastest 
possible rate that an alternative can converge to the null hypothesis and still be detectable by 
one of the tests in the given class. These rates and alternati yes depend on h and the sampl e 
size n. See, for instance, the development in Section 3.2 of IFan. Zhang, and Zhang! (120011 ). 
We consider alternatives not depending on h and ask what h maximizes the probability of 
detecting the alternative. We find that this h does not tend to unless the volume of the 
set on which E(Y\X.) is nonconstant tends to zero as n — > oo. 

This paper is organized as follows. Section [2] defines and motivates our signal to noise 
criterion and Section [3] develops asymptotic properties. Section [4] describes how we incorpo- 
rate variable selection into the procedure for bandwidth selection. Section [5] explains how 
critical values can be approximated via simulating the distribution of the criterion under 
random permutations of the data. Section [6] shows results from Monte Carlo simulations 
and analysis of real data. 



2 Local Efficacy in Nonparametric Regression 

We consider (X l5 Y x ), (X 2 , Y 2 ), . . . , (X re , Y n ) i.i.d. as (X, Y) ~ P, X e lR d , Y e H, and 
write Y = /i(X) + e where /i(X) = E(Y\X.) is assumed to exist and where e = Y — 
yu(X). We will focus on finding dependencies between Y and the covariate of interest X\ 
for X in a neighborhood of a given covariate vector x from a targeted unit (e.g. patient, 
component, DNA sequence). We want to know if a perturbation of x\ will affect the mean 
response for units with covariate values near xo. Formally, we test Hq : u Xi is independent 
of X2, X3, . . . , Xd, Y" versus Hi'. "/i(x) is not constant as a function of X\ G JR." 

Our test statistics will be based on linear fits for x restricted to subregions of the sample 
space. The subregion that best highlights the dependencies between Y and covariates will 
be determined by maximizing signal to noise ratios. We illustrate such procedures and their 
properties by first considering a linear model fit locally over the neighborhood iYh(x ) of x 
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where h = (hi, h 2 , . . . , h^) are bandwidths in each of the d dimensions. The true relationship 
between Y and X is such that E[Y\ X = x] = /u(x) and Var(F| X = x) = a 2 are unknown. 
The local linear model gives fitted values as a function of x 6 iV h (x ), denoted 

d 

/X L (x) = Pq + ^ Pj( X 3 ~ X 0j) > ( 6 ) 

5=1 

where (3j = (3j(h) are sample coefficients depending on x and h. Thus, for coefficients 
(3j = /3j(h) depending on Xo and h, 

d 

j u h (x) = E\fi L (x)] = fa + ^2pj(xj - x j) ■ (7) 

i=i 

We let (3i(h) be the test statistic T n for testing Hq versus H^p and develop some of the 
properties of the efficacy and estimated efficacy for this test statistic. Let V(h) = P(X G 
^h(xo)) and assume throughout that < V(h) < 1. 

2.1 The Signal 

Formally, the signal is E0i(h)) = /^(h) where /3(h) = (/? (h), /?i(h), . . . , (3 d (h)) T is the 
coefficient vector of the best local linear fit to Y. That is 

/3(h) = argmin{£ h [r - (a + b T X)] 2 : a G H, b G IR d } (8) 

where E* 1 is expected value for the conditional distribution P h of (X, Y) given X G A^(xq). 
With Var h and Cov h also being conditional, 

(/3 1 (h),...,/3 d (h)f = S x 1 (h)S X y(h) (9) 

where E x (h) = Var h (X) and S X y(h) = Cov h (X,y). 

Similarly, /3(h) is the local linear least squares estimator 

3(h) = (x£X h )~ 1 X£Y h (10) 

where is the design matrix (Xij), X i0 = 1, < j < d, i G X with X = X& G iVh(x )}, 
and Yh = {li : % G X}. It follows that conditionally given X = (Xi, . . . , X„), 

Var(3(h) | X) = a\XlX^-\ (11) 
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By writing X£X h = X?W h X where W h = diag(l[X; e X h (x )]), < i < n, with 1[X G 
^h(xo)] = 1, and X D is the full design matrix (Xjj) rex (d+i), we see that by the law of large 
numbers, 

n- x X£X h ^ £(X D Xjl[X e X h (x )]) = V(h)£(X D X D T |XeX h (x )) 

= V(h)Eix(h). (12) 

Because £(/?i(h) | X) /3i(h) as n — > oo, this shows that the estimated local signal /3i(h) 
is conditionally consistent given X as n — > oo and d — > oo when 

n-V^^V-^h) ^ 0, (13) 

where <r&(h) is defined by (oi x (h)) = £^ x (h), 0<j<d, 0<k<d. 

2.2 The Noise 

The denominator of the efficacy, the noise, is the asymptotic standard deviation 0g(/?i(h)) 
of /?i(h) under Hq. We derive a formula and develop a consistent estimator in what follows. 
Let nh be the number of X; that fall in iVh(xo). Note that 

[3(h) - /3(h)] = [n^X^Yh - £ ix /3(h)] } (14) 

where S 1X = n^ 1 (Xj[X h ) — —> E ix (h) by Equation ffT2l) and [rih/nV(h)] — ^ 1. By Slutsky's 
Theorem, y/n[(3(h) — /3(h)] has the same asymptotic distribution as y/n {n^ 1 S^ x X^e h j 
where e h = Y h - /3 T (h)X h . Note that by Equation ©, £ h (X£e h ) = 0, and with = 
Yi - /3 T (h)X i; 

n^X^ = (n/n h ) In" 1 ^X^e^X, e X h (x )): < j < tzj . (15) 

Thus by Slutsky's Theorem and the Central Limit Theorem 

VE(n^Xle h ) -^(O^MExeCh)) (16) 

where 

Sxe(h) = (E(e 2 XjX k | X G X h (x ))) (d+1)x(d+1) (17) 

with e = Y - /3(h) T X. 

We have shown the following. 
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Proposition 1 . Suppose that h and d are fixed in n, that V(h) > 0, < Var h (F) < oo, and 
that S^ x (h) exists; then 

[3(h) - /3(h)] - Af(0, V~\h) E£(h) S Xe (h) S^(h)) . (18) 

By using Equation f|T5|) and Liapounov's Central Limit Theorem we can allow d and h to 
depend on n when we consider the asymptotic distribution of y/n\fli(h) — /3i(h)]. We have 
shown the following. 

Proposition 2. Suppose that E^ x (h) exists, that [cr 11 (h)] 2 V(h) is bounded away from zero 
and that < Var(y) < oo; then asn^ oo, 

^[?i(h)-/3i(h)] ^Ar(o,a 2 (A(h))). (19) 

Because under H^ 1 , X\ is independent of X 2 , . . . , JQ and F, if we set /Ui,(x) = /3 + 
Y^j=20jXj an d e = F — /xl(x), the asymptotic variance of ^/ri[/?i(h) — /3i(h)] is 

a ^ l(h)) s vrt,nf° [ ^ y v (20) 

V(h)Var^ o (X!) 

Now (To(f3x(h)) / ^/ri is the noise part of the efficacy 

The sample variance s 2 (hi) calculated using {JQi : X iX G [x i — hi, x i + /ii]} is our 
estimate of a\ = Var# (Xi). It is consistent whenever nh\ — > oo as n — > oo. Note that with 
/i (X) = E% Q (Y\X), the null residual variance a 2 (h ( - iy ) = E% Q (e 2 ) is 

^(h^) = a 2 + ££>(X) - /i L (X)] 2 ee a 2 + a 2 L (h^) (21) 

where h*" 1 ' = (h,2, • • • , hd) and <r 2 (h 1 " 1 ') is the contribution to the variance of the error due 
to lack of linear fit to /io(X) under H . Let ^^'(X) = (3 + X^=2 A/^i t> e the locally linear 



v/ 

fit based on 



V{h^) = {(Xij, Yi) : Xij e [x 0j - hj, x 0j + hj] , 2 < j < d, 1 < i < n} , (22) 

then a natural estimator for o~ 2 is 

^(h^) = (n(h^) - d)- 1 £ ^-^'(X,)] 2 , (23) 

©(he- 1 )) 

where ra(h (_1) ) is the number of data points in V(h ( ~ 1) ). The following can be shown (see 
Appendix). 



Lemma 1. If a 2 > 0, a^h'- 1 ') < oo, then, with V(h. ( ~ 1} ) = E(n(h ( - 1} ))/n, 



E 



H 



D(h(-!)) 



[nV^" 1 ') - d] a 2 + nV(h < - 1) ) ^(h'- 1 ') . 



(24) 



Thus Sg(h (_1) ) is a consistent estimator of <7g(h (_1) ) under whenever [d/nV(h { ~ 1) )] — > 
and Varjy [Sg(h)] — > as rt — > oo and d — > oo. It follows that under these conditions, a 
consistent estimate of y/n x noise = 01(h) = o"o(/3i(h)) is 

^(h™) = {4(h ( - 15 ) / V(h^) S ?(^)} 1/2 • (25) 

Note that because f3\ — > under Hq , we can use Slutsky's theorem to show that this 
estimate is asymptotically equivalent to the usual linear model estimate of SD h (-y/n /?i(h)) 
based on data (X i; Yj) with Xj G iV h (xo). 



2.3 The Efficacy Criterion 

The usual procedures for selecting bandwidths h involve minimizing mean squared predica- 
tion or estimation error when predicting the response Yq or estimating the mean E(Yq | xo) 
of a case with covariate vector x . The disadvantage is that significant relationships may 
escape procedures based on h selected in this way because they will be based on small band- 
widths which lead to large noise. Here we propose to choose the bandwidth to minimize 
the probability of Type II error, that is, maximize power among all t-tests with the same 
asymptotic significance level. We thereby maximize the probability of finding significant 
relationships between Y and a specified covariate. This procedure automatically selects h 
to keep the noise small. 

Because T n = (3-\ (h) i s asymptotically normal and EH (T n ) = 0, we know (Pitman ( 1948 ). 



Pitman! (119791 ). ISerflind (Il980l ) . iLehmannl (119991 )) that for contiguous alternatives the h that 



maximizes the asymptotic power is the h that maximizes the absolute efficacy, where 



n 



-V 2E FF l( h,x ) = ^ 



(26) 



Because /3i(h) = a lj (h)a Yj (h), where a Yj (h) = Cov h (X j ,Y), we can write 



n 



(27) 



The efficacy optimal h for testing versus is defined by 

h< 0) = (^ 1 ) ,...,/ i ; o j) T = argmaxEFF 1 (h,x ). (28) 

For fixed alternatives that do not depend on n with /?i(h) > 0, this will not satisfy 
min{hfj} — > as n — > oo because o"i(h) — > oo as min{hfj} — > 0. 

Remark 1. The definition of the efficacy optimal h makes sense when the relationships 
between Y and covariates are monotone on iV h (x ). To allow for possible relationships such 

as 

Y — a + ■ ■ ■ + (Xj — x 0j ) 2 + ■ • • + e (29) 

in Section [3] we will use neighborhoods of the form [x j — (1 — \j)hj, x j + (1 + Xj)hj], where 
— 1 < Xj < 1, rather than the symmetric intervals [xqj — hj, xqj + hj]. The properties of 
the procedures ar e not changed much by the introduction of the extra tuning parameters 



Ai, A2, . . • , Arf. See iDoksum and Schaferl (120061 ) for the case d = 1. 



The estimate of n 1 / 2 EFF 1 is n 1//2 t!(h, x ) where ti(h, x ) is the t-statistic 

ti(h,Xo) = ^ n/s ^ V ; (30) 

otherwise. 

Write d n for d to indicate that d may depend on n. Using the results of Sections 12.11 and 



12.21 we have the following. 

Proposition 3. In addition to the assumptions of Proposition [21 assume that as n — ■> 00, 
[d n /nV(h)} — > 0, [cr 11 (h)] 2 V(h) is bounded away from zero, and Var// [s 2 (h)] — > 0; then 
ti(h, x ) is a consistent estimate of EFFi, in the sense that 

ti(h,x ) p , , 

r — > 1 as n — > 00. (31) 

EFF^Xq) 1 ; 

Corollary 1. Suppose Q (a grid) denotes a finite set of vectors of the form (h, xo), then, 
under the assumptions of Proposition [31 the maximizer of ti(h, x ) over Q is asymptotically 
optimal in the sense that 

max{ti(h,x ): (h,x ) E Q] p 



Remark 2. 



max {EFFi (h,x ): (h,x ) E G} 



(32) 



Hall and Heckmani (120001 ) used the maximum of local t-statistics to test the global 



hypothesis that fi(-) is monotone in the d = 1 case. Their estimated local regression slope is 



9 



the least squares estimate based on k nearest neighbors and the maximum is over all intervals 
with k at least 2 and at most m. They established unbiasedness and consistency of their 
test rule under certain conditions. 

Because the asymptotic power of the test based on /3i(h) tends to one for all h with 
/3i(h) > and |h| > 0, it is more interesting to consider Pitman contiguous alternatives if}* 
where /3i(h) depends on n and tends to zero at a rate that ensures that the limiting power 
is between a and 1, where a is the significance level. That is, we limit the parameter set to 
the set where deciding between and the alternative is difficult. We leave out the cases 
where the right decision will be reached for large n regardless of the choice of h. 

Now the question becomes: For sequences of contiguous alternatives with /3i(h) = /?i„(h) — 
as n — > oo, what are the properties of ft-in? I* 1 particular, does the efficacy optimal 
tend to zero as n — > oo? The answer depends on the alternative as will be shown below. 

2.4 Optimal Bandwidths for Pitman Alternatives with Fixed Support 

Under 

d d 

ft(h) = E^( h ) = E^ 11 ) Cov h (X„F) = 0. (33) 
i=i i=i 

We consider sequences of contiguous Pitman alternatives with Pij(h) oc cn -1 / 2 , such as 

Y = a + 7 nr(X) + e (34) 
where 7„ = cn~ 1/2 , c^O, and |Cov h (X,-, r(X))| > b, for b > 0. Here, 

EFFl(h , Xo) _(«^)|: ftj , h) ,35) 

with /?ij(h) = cr lj (h)Cov h (Xj, r(X)). As in the case of fixed alternatives, the maximizer hj 1 
does not satisfy mm{hfj} — > as n — > oo because EFF^h, x ) — > as min{/i^ } — * 0. This 

,(0) ; 



(0) 



is asymptotically optimal in the sense of Corollary [TJ 
2.5 Comparison with Mean Squared Error 

There is a large literature on selecting bandwidths by minimizing mean squared error (MSE). 
Here MSE can be expressed as the following. 
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Proposition 4. If < Var(F) < oo, then 



E h {[/2 L (x )-/i(x )] 2 |X} = h [/i L (x ) -/i(x )] 2 

n h 

a 2 2 

[/i L (x ) - /i(x )] +op(l/n). 



(36) 



raV(h) 

Proof. MSE is variance plus squared bias where the conditional squared bias is as given. 
The conditional variance of the local least squares estimate /Il( x o) = A)(h) given X is cr 2 /n h 
where (n h /n) — > V(h). □ 

If finding significant dependencies is the goal, we prefer maximizing Equation (j27l) to 
minimizing Equation ( 136]) because the local MSE ( 1361) . as well as its global version, focuses 
on finding the h th a t mak es (x) close to the t rue u nknown curve /i(x). Using results of 



Ruppert and Wandl (119941 ) and iFan and Gijbelsl (119961 ). page 302, we can show that under 



regularity conditions (the Hessian of /j(x) exists), the bandwidths minimizing (1361) tend to 
zero at the rate n~ l ^ d+A \ By plugging such bandwidths into Equation (1271) we see that this 
will in many cases, make it nearly impossible to find dependencies. Various semiparametric 
assumptions have brought the rate of convergence to zero of the bandwidths to n~ 1 ^ which 
still makes it likely that dependencies will be missed. By using ( 1271) we have a simple 
method for finding bandwidths that focus on finding dependencies for any type of alternative. 
However there are alternatives where h — > makes sense. We construct these next. 



3 Bandwidths for Alternatives with Shrinking Support 

We consider sequences of alternatives where the set A n = {x: |/j(x) — fi Y I > 0} is a connected 
region whose volume tends to zero as n — > oo. One such set of alternatives is given by Pitman 
alternatives of the form 

K n . Y = a + ^W J (^^)+e, (37) 

where X and e are uncorrelated, e has mean zero and variance a 2 , and each Wj(-) has 
support [—1,1]. We assume that each Xj is continuous, in which case the hypothesis holds 
with probability one when jj6j = for all j. We consider models where 6j = 6^ — > as 
n — > oo, and jj may or may not depend on 9j and n. For these alternatives the neighborhood 
where E(Y\X.) is non-constant shrinks to achieve a Pitman balanced model where the power 



11 



converges to a limit between the level a and 1 as n — > oo. Note, however, that the alternative 
does not depend on h. We are in a situation where "nature" picks the neighborhoo d size 
6 1 = (0 -\ ... . , 9ri) T , and the statistician picks 



;he bandwidth h. This is in contrast to 



( 11993 1. iFan. Zhang, and Zhand (120011 ). and 
the alternative depend on h. 

We next show that for fixed h with |h| > 0, EFF^h, x 
model (j37|) . First note that 



Ait-Sahalia. Bickel. and Stoker! (120011 ) who let 



Blyth 



as maxj{7j#j} — > in 



Cov h (x„ y) = ^Cov h (x„ w* C 
k=i \ \ 



Next note for hk > fixed, 7^ bounded from above, and 9k 

Xk — %0k 



0i 



0k 

-o, 

-> 



(3J 



(39) 



because Wfc((Xfc — xok)/9k) tends to zero in probability as 9k — > and any random variable is 
uncorrelated with a constant. This heuristic can be verified (see the appendix) by a change 
of variable. This result is stated more precisely as follows. 

Proposition 5. If h is fixed with |h| > 0, then as vaaXki'Jk^k} ~ ¥ in model (137|) . 

(a) EFFi(h,x ) = 0(max fc {7 fe 6> fc }) and 

(b) ti(h,xo) = P (max fe {7 fc 6) fe }). 

Proposition [5] shows that for model (137|) . fixed h leads to small EFFi(h, xo). Thus we turn 
to the hi — > case. If hi > 0, then observations (X 1; Y) with Xi outside [xoi — 0i, Xqi + 9i] 
do not contribute to the estimation of /3i(h). Thus choosing a smaller hi may be better 
even though a smaller hi leads to a larger variance for /?i(h). This heuristic is made precise 
by the next results which provide conditions where hi = 9i is the optimal choice among hi 
satisfying hi > 9±. First, define 



rrij(Wi 



s j Wi(s)ds, .7 = 0,1,2. 



(40) 



Theorem 1. Assume that Xi is independent of X 2 , . . . ,Xd, that the density /i(-) of Xi has 
a bounded, continuous second derivative at xq, and that fi(xo) > 0. Then, in model (!37|) . 
as 0i -> 0, /ii -> with fe x > U and h*- 1 ' fixed with (h'- 1 '] > 0, 

(a) Cov h (X 1 ,Y)= ll 9lm 1 (W 1 ) A^ + o^). 

(b) If mi (W0 ^ and m 2 (Wi) ^ 0, then 

" 1/,2 EFFi(h, x ) oc [a 2 + ^(h«- 1) )] 1/2 7 1 ^/ l - 3 / 2 mi (Wi)/ 1 1/2 (xoi)V 1/2 (h ( - 1 ») (41) 
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which is maximized subject to hi > Q\ by h\ —Q\. 

Theorem 2. Suppose the joint density /(x) of X has a bounded continuous Hessian at xo, 
and that /(xo) > 0, then in model (137|) . as |h| — > and \B\ — > with fai > #i, conclusion 
(b) of Theorem 00 with V 1/2 (h ( - 1) ) replaced by {2*" 1 ]\ d j=2 hj} 1 / 2 holds. 

The case where h\ < 9± remains; then some of the data in the neighborhood where 
the alternative holds is ignored and efficacy is reduced. Under suitable conditions on Wj, 
j = 1, . . . , d, the optimal h\ equals Q\\ that is, 2hi is the length of the interval where the 
X\ term in the additive model (1371) is d ifferent from zero. De t ails o f the justification can be 
constructed by extending the results of iDoksum and Schaferl (120061 ) to the d > 1 case. 

Remark 3. 



Fan 



(11992( 1 



Fan 



(119931 ) , and iFan and Gijbelsl (119961 ) considered minimax kernel 



estimates for models where Y = fi(X)+e in the d = 1 case and found that the least favorable 
distribution for estimation of n(xo) using asymptotic mean squared error has Y = fio(X) +e 
with 



2 n 



X — Xq 



(42) 



where b n = Cqu 1 ' 5 , for s ome positive cons t ants c and cn- Similar results were ob tained for 



th e white noise m odel by lDonoho and Liul (jl991al lbl). 



on 



Lepski and Spokoinyl (119991 ). building 



Ingsterl (119821 ). considered minimax testing using kernel estimates of fj,(x) and a model 
with normal errors e and Var(e) — >• as n — >• oo. Their least favorable distributions (page 
345) has /j,q(x) equal to a random linear combinations of functions of the form 



h 1 ' 2 W 2 {t)dt) W 



x 



(43) 



4 Variable Selection 

Suppose we want to test Hq that Xi and Y are unrelated and wonder whether we should 
keep Xj, j > 2, in the model when we construct a t-statistic for this testing problem. In 
experiments where confounding variables are possibly present, it would seem unreasonable 
to keep or exclude Xj on the basis of power for testing H$ because confounding could lead 
to dropping Xj in situations where the relationship between X\ and Y is spurious. For 
this reason it would seem more reasonable to base the decision about keeping Xj on the 
strength of its relationship to (Xi,Y) and thus we ask whether Xj contributes to accurate 
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prediction of the response Y conditionally given X\. More generally we consider dropping 
several variables and consider fits to Y over subsets of IR d that vary both in dimension and 
size (tuning parameter) within each dimension and we select the dimension and size which 
maximizes efficacy in the X\ direction while controlling for spurious correlation. 

We define a feature S k to be a subset of the covariate space which yields the maximal 
absolute t-statistic for some xg^. In other words, among all of the subregions over which 
linear models are fit, we discard those that do not maximize the absolute i-statistic for some 
covariate vector x. The remaining regions are denoted <Si, <S2, . . . , S r . Assume the features 
are placed in decreasing order of the value of absolute t-statistic for the variable of interest. 
These features are subsets of IR d ordered with respect to their relevance to the relationship 
between X\ and Y. Define 

S' k = S k n(S[US' 2 U---US' k _ 1 ) c (44) 

with S[ = Si. Then UkS' k = U k Sk and the S' k are disjoint. 

There are competing goals: We want the model to be parsimonious (not include too many 
covariates), especially since we are focusing on cases where d is large. But, we don't want 
to exclude any covariate which, if included, changes the picture of the (Xi,Y) relationship. 
Thus if X\ and X2 are closely related, as are X2 and Y, there will also be an apparent 
relationship between X\ and Y . To avoid making false claims regarding the strength of the 
relationship between X\ and Y, we tentatively include X2, and let the analysis show whether 
that weakens the (Xi,Y) relationship. 

This analysis of the (Xi, Y) relationship after correcting for other X's can be accomplished 
in the following way: Consider the quantity 

2 



7(1!,^) = J5 



j 



(r-/2f(x)) \x 1 = x 1 



(45) 



the expected squared prediction error when X\ — X\ , where (X, Y) is independent of (Xj, Y), 
1 < i < n, and fij (x) stands for the model fit for model j based on a subset of covariates 
that are restricted to subregion S. We seek the subset of the covariates which minimizes 
this criterion. If there is no relationship between X 2 and Y when X\ = X\, then X 2 will 
be excluded based on comparisons of 7(^1, fj}j ) for different models j; this could lead to 
different covariates included for different values of X\ . 
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This is implemented as follows. Within each S^, m different linear models are fit; these 
models differ in which covariates are included, but the notation is consistent in the sense 
that "model j" always refers to the same list of covariates. Let /^(x) denote the fitted value 
at covariate value xGiSt from model j fit over Sk and let /ij( x ) — A*jfc( x ) if x ^ S'k- Each x 
lies in exactly one S' k , so this is uniquely defined. 

The mean squared prediction error for a model fit is commonly approximated by the leave- 
one-out cross validation score based on the fitted value ju_j(Xj) from the fit with (Xj,Fj) 
removed. If ju is a linear model fit using least squares then the cross-validation prediction 
error is 

(^-^(X,)) 2 =(^ T ^) 2 (46) 

where hi is the i th diagonal element of the hat matrix from the linear model. 

Here, we do not fit a global linear model, but the fitted value /x^(Xj) is the result of the 
fit from some linear model. Let hij denote the diagonal element corresponding to Xj of the 
hat matrix from the linear model fit which gives us juj- (Xj). Then define 

Ms) = ( Yj-jtp (XQ 
ll] \ 1 - l>u 

Note how this automatically adjusts for the differing degrees of freedom in the different 
models since XX 1 ~~ hij) — n — d — 1. We can now estimate 7(-, A*j ) by the smooth 7(-, ju) 
of 7^- versus the observed value of Xi for each data point Xj, i = 1,2, ... ,n. For a given 
Xi = xi, the model j within Sk with the smallest 7(^1, fi^f^) is selected. Figure 
an example where four models are compared. 

Note that we avoid the curse of dimensionality by using features Sk with large values of 
the t-statistics in direction X\. If there are a large number of variables, instead of considering 
all possible combinations of variables, we can use the backward deletion approach commonly 
used in connection with AIC or SBC (Schwarz's Bayesian Criteria). 

To summarize, by selecting the features as described above, we are simultaneously select- 
ing the number of variables to include and the size of candidate neighborhoods for computing 
t-statistics in a given direction, here X±. We select the neighborhoods where the t-statistics 
are maximized and we use conditional prediction error to select the variables in such a way 
that we are protected against using models that produce spurious correlations. 
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Remark 4. A great number of tests of model ass umptions and variable selection 



dures are a vailable in a nonparametric 



Raz ( 



1990). 



Aibank and LaRiccia ( 



1993). 



s etting, e.g. 



Azzalini. Bowman, and Hardld ( 



Hardle and Mammenl ( 



1993). 



proce - 



1989) 



Bowman and Young 



fll996h.lHartlfll997l).IStutelfll997l).lLepski and Spokoinvl fll999h.lFan. Zhang, and Zhangl fl200ll) 



Polzehl and Spokoinyl (120021 ) . IZhangJ (120031 ). ISamarov. Spokoiny. and Viall (120051 ). among 



others. One class of tests of whether the jth variable has an effect looks at the differ- 
ence between the mean /x(x) of Y given all the x's and the mean fi_j(x_j) of Y given all but 
the jth variable. Specifically, for some weight functions w(X), measures of the form 



£{[Mx)-/^.(x_,)] 2 Mx)} 



are considered (IDoksum and Samarovj (119951 ). lAit-Sahalia. Bickel. and Stoker! (120011 )). Sim- 
ilar measures compare prediction errors of the form 

£{[F-/i(X)MX)} and E{[Y - /^(X^)] 2 ™^)}. 

Our efficacy measure is a version of /x(x) — jti_j(x_j-) adjusted for its estimability while our 
measure of spurious correlation is a conditional version of the above prediction errors. 



5 Critical Values 

First consider testing : ll X\ is independent of X 2 , A3, ... , Xj, Y" against the alternative 
= ifj^(xo) that /x(x) is not constant as a function of x\ in a neighborhood of a given 
covariate vector xo from a targeted unit of interest. Our procedure uses the data to select the 
most efficient t-statistic adjusted for spurious correlation, say ti(x , (X, Y)), where (X, Y) = 
((Xij) nxd , (li) n xi) are the data. Let XJ be a random permutation of (Xq, . . . , X nl ) and 
X* = (X*, X 2 , . . . , X d ) where Xj = (X y , . . . , X nj ) T . Then, for all c> 0, 

P Ho (\ti(x , (X, Y) | < c)) = P(|ti(x , (X*, Y) \<c)). 

Next, we select B independent random permutations X£, . . . , X^ and use the (1 — a) sample 
quantile of |ti(xo, (X£, Y))|, for k = 1,2,..., B, as the critical value. As B — » oo, this 
quantile converges in probability to the level a critical value. Figure Q] gives as example of 
the simulated distribution of |ii(x , (X, Y))| under the null hypothesis for x = (0.4, 0.4, 0.4). 
Here B = 500, and the model is described in Section [BJ stated in Equation (|48|) . 
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4 5 6 7 

Maximum Absolute t-statistic 



Figure 1: The simulated distribution of the maximum absolute i-statistic at Xo = (0.4,0.4,0.4) using the 
model described in Section [SJ Equation 



Note that although ti(xo, (X, Y)) is selected using absolute values of efficacy, we can still 
perform valid one-sided tests by following the above procedures without the absolute values. 
In this case the alternative is that /i(x) is increasing (or decreasing) as a function of X! in a 
neighborhood of x . 

Next consider testing Hq against the alternative that X\ is not independent of 
X 2 , . . . , Xdi Y. If x^, . . . , is a set of grid points we can use the sum, sum of squares, 
maximum, or some other norm, of 

|t 1 (xW,(X,F))|,...,|t 1 (x^,(X,F))| 

to test Hjp versus H^. We may also instead of grid points use X 1; . . . , X n or a subsample 
thereof. Again the permutation distribution of these test statistics will provide critical values. 

Finally, consider the alternative that /x(x) is monotone in x\ for all xi G H. We would 
proceed as above without the absolute values, use a left sided test for mono tone decreasing, 
and a right sided test for monotone increasing, see lHall and Heckmanl (120001 ) who considered 
the d = 1 case. 
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6 The Analysis Pipeline; Examples 



In this section we describe practical implementation of the data analysis pipeline developed 
in the previous sections, using both simulated and real data. The simulations are built 
around the following functions, plotted in Figure [21 

= 4x-2 + 5exp(-64(x-0.5) 2 ) 
f2(x) = 2.5xexp(1.5 — x) 
f 3 (x) = 3.2a; + 0.4 




0.0 0.2 0.4 0.6 0.8 1.0 



x 

Figure 2: The three functions which will be used in the simulations. 

The algorithm can be described as follows. Fix a covariate vector xo G [0, l] 3 and a 
neighborhood which includes x , but is not necessarily centered at x . A linear model is to 
be fit over this region. The slope in the X\ direction j3\, is estimated using least squares, 
giving us (3\. The t-statistic for f3\ follows easily. One can now imagine repeating this for all 
possible neighborhoods, and finding that neighborhood which results in the largest absolute 
t-statistic for (3\. We then imagine doing this for all possible values of Xq. In practice, a 
reasonable subset of all possible neighborhoods will be chosen, and models fit over these 
regions. Here, we lay down a grid of values for each of the covariates, based on evenly spaced 
quantiles of the data. The number of grid points can be different for each covariate; we choose 
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to have a larger number for the covariate of interest (we use 15) than for the other covariates 
(we use 5). Thus, we have an interior grid made of up of 15 x 5 x 5 = 375 covariate vectors. 
A neighborhood is formed by choosing two of these points and using them as the corners of 
the region. Thus, there are a total of 375 x 374/2 = 70125 potential neighborhoods; some 
will be discarded due to sparsity of data. 
We consider a simple case first. Set 

Y = f 1 (X 1 ) + f 2 (X 2 ) + f 3 {X 3 ) + e (48) 

where e is normal with mean zero and variance cr 2 = 0.02. A sample of size n = 1000 is 



taken. The random variables X\, X 2 , X 3 are i.i.d. W(0, 1). Figure 3(a) shows the raw results 



of the analysis. For each of the local linear models, there is one line on the plot. The value 
on the vertical axis gives the t-statistic for fa. The range of values along the horizontal axis 
represents the range of values of the variable of interest for that neighborhood. The shade 
of the line indicates the proportion of the covariate space in the other two variables covered 
by that neighborhood. For example, the darker lines indicate regions which, regardless of 
their width in the Xi direction, cover most of the X 2 and X 3 space. 




0.0 
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0.4 



0.6 



1.0 
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(a) Raw t-stat Plot 



(b) t-stat Plot 



Figure 3: Figure 3(a) the plot of t-statistics for each neighborhood fit for the simple case of Equation (T48 



Figure 3(b) is the same, except it only shows those neighborhoods which maximize the absolute ^-statistic 



for some covariate vector x . 



Figure 3(b) is the same as Figure 3(a) except that the only regions plotted are those that 



yield the maximum absolute t-statistic for some xq. A convenient computational aspect of 
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this approach is that wide ranges covariate values Xq will share the same region: We can 
describe fixing xq and finding the optimal region for that Xq, but in fact we only need to fit 
all of these 70125 candidate models, calculate the t-statistic for each, and then discard any 
fit which is not the maximum for some Xq. In this example there are 18 regions remaining 
following this process. The regions represented in this plot are the previously defined features, 
Si,S2, ■ ■ ■ ,S r . We call this plot the "t-statistic plot for variable of interest Xi" 

A more useful way of plotting the remaining regions is shown in Figure HI We will refer 
to this graph as the "feature plot for variable of interest X\. n In each plot, there is one light, 
horizontal line for each feature; note that they are labeled with numbers going from 1 to 18. 
The vertical axis gives (3\ for that local model. Each dot is an observed data point, and the 
shade of the dots on one line (i.e., in one region) again represents the extent of that region 
in the other two covariates. 

From this plot, one can pick out dependencies in the data. Consider the regions labeled 
"1," "2," and "3." For each of these, the range of values of X\ in the region is approximately 
0.5 to 0.7 (look at the third plot). This is capturing the steep downslope in fi(x) for x in 
that range. But, the dependence between X 2 and Y (characterized here by the slope fa) 
varies with Xi- Ignoring this fact when modeling Y as a function of X\ would mask this 
downslope. This is seen clearly when we look at these three regions in the second plot in 
Figure HI The three regions correspond to different ranges of values of X2. Region "1" is 
approximately 0.5 to 0.75, where fi{x) is starting to level off, region "2" is extends up to 1.0, 
where /2OE) is almost flat, and region "3" goes from up to 0.5, where fi{x) is the steepest. 

Since the relationship between X 3 and Y is linear, there is no such pattern to be found 
in the last of the three plots. The light horizontal lines in this first plot show that our 
procedure chooses the largest possible bandwidths in the direction Xi, that is, the response 
Y is modeled to be linear in this direction, as we know it should be. 



Figure 5(a) shows the estimates of the expected squared prediction error j(-,jij) for each 
of four models, relative to this quantity for the "null" model, the model which only uses the 
mean of Y within that region to predict the response. We call this plot the "CV plot." In 
this case, the best choice is to use all three predictors, as evidenced by the solid black line 



being the lowest for all values of X\. Figure 5(b) (the "slope plot") shows, for each X\ value 
in the data set, the estimated slope fa for the region S' k in which that observation lies. In 
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practice, if the variable selection procedure chose a simpler model for a particular subset S^, 
that model would be used in the slope plot. 



6.1 Other Cases 

We considered alterations to the simple simulation model described in the previous section. 
First, consider a case with Y = fi(Xi) + /^(A^) + e, so that Y is not a function of X3, but 
now take (A 1? A 3 ) to be bivariate normal, each with mean 0.5 and SD 1, and with correlation 



V0.5. X 2 is still U(0, 1), and independent of Xi and X 3 . Again, n = 1000. Figure 6(a) shows 
the CV plot for this case. Note that since conditional on X\, X 3 and Y are independent, 
the variable selection procedure is indicating that X 3 could be excluded. Contrast this 
with the second case where (Xi,X 3 ) have the same bivariate normal distribution, but now 



Y = ^2(^2) + fs(X 3 ) + e; see Figure 6(b) for the CV plot. Here, the best choice is to include 
all three variables since excluding X 3 would lead to misleading conclusions regarding the 
strength of the relationship between X\ and Y. 

6.2 Analysis of Currency Exchange Data 

The original motivation for this study was to address a question in the study of currency 
exchange rates regarding the nature of the relationship between the volume of trading and the 
return. Data were obtained on the Japanese Yen to U.S. dollar exchange rate for the period 
of January 1, 1992 to April 14, 1995, a total of 1200 trading days. The response variable 
used was today's log volume with three covariates: 1) today's log return, 2) yesterday's 
log volume, and 3) yesterday's log return. The first of these, today's log return, is set as 
the covariate of interest. Figure [7] shows the feature plot for this data set. One interesting 
result is that when today's log return is positive, the coefficient for today's log return is 
positive; see feature 8 at the top of the plot. And when today's log return is negative, those 
coefficients are mostly negati ve; see features 1,3,4, and so forth, on the left side of the plot. 



This confirms a prediction of iKarpofil (119871 ). 



When variable selection is applied, we see that there is evidence that yesterday's log 



return is not important; see Figure 8(a) Finally, the slope plot (Figure 8(b)) shows again 
how the estimated slope (3\ for the coefficient for today's log return abruptly switches from 
negative to positive once that variable becomes positive. This is an important finding that 



21 



may have been missed under a standard multiple regression fit to this surface. This becomes 



clearer when inspecting the level plot shown in Figure 8(c) The plot shows one dot for each 
vector x in the data set, the horizontal axis gives today's log return, and the vertical axis is 
the fitted value fl{x). The superimposed solid curve is a smooth of the plot. For comparison, 
also plotted (as a dashed l ine) is the leve l curve for the d — 1 case where the only covariate 



is today's log return, as in 



Karpoffl (119871 ). 



Figure 8(d) is the level plot from the same analysis, except now making yesterday's log 
volume the response, and using today's log volume as one of the explanatories. Once again, 
we see some evidence of the "Karpoff effect." An interesting aspect of this plot is that when 
all three covariates are included in the analysis, the slope on the left of the level curve is 
significantly smaller than the slope on the right. This seems to imply that there would be 
a way to exploit yesterday's log volume in an effort to predict whether today's log return 
will be positive or negative. But, this artifact is removed by excluding today's log volume 
as one of the covariates: See the level curve corresponding to "Two Covariates." Using only 
yesterday's information, it now does not seem possible to utilize yesterday's log volume to 
predict if today's log return will be positive or negative. 



7 Appendix 



7.1 Proof of Lemma Q] 

The following implies Lemma HJ 

Lemma 2. Let RSS h = Y^lYi - /^(X,)] 2 , then 

£ h (RSS h ) = [nV(h) -d]a 2 + nV(h) E h [fi(X) 



MX)f 



(49) 



Proof. If we c ondition on X = (Xi, . . . , X n ), then Xo G iVh(xo) no longer are random events. 
We can adapt iHastid (119871 ). equation (16) to find 



£ h (RSS h |X) = n h a 2 + S h [/i(X;) - E(j2 L (Xi))} 2 - da 2 



(50) 



Here i? h (/ii(Xj)) = /i^(Xj) and Lemma [2] follows by the iterated expectation theorem. 
Lemma [1] is a special case with fi and fiL computed under the null hypothesis. □ 
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7.2 Proof of Proposition 

Proof. We compute expected values using the joint density 

/ h (x) = /(x) l[x G iV h (xo)] / V(h) (51) 
of X given X G iVh(xo). Thus, with k — j, 

= j h - E h (X,)] W 3 (^^) /]>;) dxj, (52) 

where 

ff( x i) = j / h (x)dx ( _,), (53) 

with X(_j) = (xi, . . . , Xj-i, Xj + i, . . . , Xd) T , is the marginal density of Xj given X G iVh(xo). 
The change of variables Sj = (xj — Xoj)/9j gives 

Cov h ^,H/,(^-^^ 0, |* [x 0j -E h (X,) + ^]^,( S ,)/ J (x 0j +^)^, (54) 

A similar argument for fc 7^ j shows that 7fcCov h (X,-, Y) = O(maxk{ / yk0k})- Now (a) follows 
because the terms in EFFx other than 7/ c Cov h (X ? -, y) are fixed as 7^ — > 0. Finally, (b) 
follows from Proposition [31 □ 

7.3 Proof of Theorem [TJ 

Proof. Because of the independence of the X\ and X2, ■ ■ ■ , X^, 

n -l/2 EFFl ( h;XQ ) = [ (T 2 + (T 2 (h (- 1 ) ) jV2 v l/2 (/li)v l/2 (h( - 1 ) ) [SD ftl (Xi)] _1 C0V hl (Xi,y) 

(55) 

where V(/ii) = Pjxsn — fri < < fCm + ^i)- The result now follows from the proof of 



Theorem 2.1 in 



Doksum and Schaferl (120061 ). □ 



7.4 Proof of Theorem [2] 

Proof. The proof can be constructed by using the fact that for small |h|, X\, . . . ,Xd given 
X G iVh(xo) are approximately independent with U(xoj — hj, xoj + hj), 1 < j < d, distribu- 
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tions. This can be seen by Taylor expanding /(x) around /(x ) and noting that 



P(X E A | X G iV h (xo)) = / f(x)dx / f /(x)dx 



^AnJV h (x ) / JN h (^ ) 



AnN h (x ) I JN h (xo) 




(56) 



A similar approximation applies to moments. Now use the proof of Theorem [T] with appro- 
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Figure 4: The feature plot for variable Xi, for the simple case. 
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Figure 5: More results from the analysis of the simple model. Figure 5(a) is the CV plot, comparing 



competing models for the purposes of variable selection. Figure 5(b) is the slope plot for variable X\ 
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Figure 6: Examples from analyses of extensions of the simple model. Figure 6(a) is the CV plot for the 



case where X\ and X3 are dependent, but Y is not a function of V3, comparing competing models for the 
purposes of variable selection. Figure [6(b)] is the CV plot for the case where X\ and X 3 are dependent, but 
Y is a function of X3, not of X\. 
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Figure 7: The feature plot for today's log volume for currency exchange data. 
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(c) Level Plot 
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Figure 8: Results from the analysis of the currency exchange data: The CV plot (Figure 8(a) ), the slope plot 
(Figure [8(b) | ), and the level plot (Figure 8(c) I from the analysis where the response is today's log volume. 
Figure [8(d)| is the level plot for the using yesterday's log volume as the response. 
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